74 research outputs found

    Variation in the Marking of Text Organization in French Research Articles: From Short and Specific to Extended and Vague

    Get PDF
    Adopting a text linguistic, corpus-based approach, this article studies variation in the marking of text organization. Why is text organization sometimes signaled very precisely, while sometimes signaling does not occur at all? The focus is on a particular mode of text organization, taking the form of text sequences, i.e. structures at least partially signaled by markers of addition or order, such as first, the last example. The material consists of 90 research articles in French with manually XML-annotated text sequences (XML = Extensible Markup Language). The results highlight the variation in the marking and show several factors affecting it. In shorter sequences, the marking is typically explicit and precise, while in longer ones explicit marking is more often omitted; when used, only vague markers, signaling simple addition, are present. In addition, different markers tend to be used in the signaling of sequences of different lengths.En adoptant une approche textuelle et quantitative, cet article examine la variation du marquage de l’organisation textuelle. Pourquoi l’organisation textuelle est-elle parfois indiquée très précisément, alors que parfois elle n’est pas signalée du tout ? L’article se concentre sur un mode particulier de l’organisation textuelle, les séries linéaires, qui sont des structures dont les items sont au moins partiellement signalés par des marqueurs d’addition, d’ordre, ou de progression, tels que d’abord, le dernier exemple. Le matériel consiste en 90 articles de recherche en français, annotés en XML (Extensible Markup Language). Les résultats soulignent le rôle de la variation dans le marquage et montrent que plusieurs facteurs textuels entrent en jeu. Dans des séries linéaires plutôt courtes, le marquage est typiquement explicite et précis, tandis que dans des séries plutôt longues, le marquage est plus souvent absent ou seulement des marqueurs vagues sont utilisés. De plus, la longueur de la série linéaire a un effet sur le type de marqueur utilisé

    Les discussions Wikipedia : un corpus pour caractériser le genre « discussion »

    Get PDF
    International audienceCette présentation propose une description des caractéristiques intra-linguistiques des discussions Wikipedia, forum de discussion associé à chaque article de l'encyclopédie Wikipedia. Après un exposé des propriétés qui font de ces textes un objet d'étude particulièrement intéressant pour les linguistiques de corpus, nous présenterons la procédure de constitution du corpus de discussion et une première description quantitative du corpus constitué. Nous finirons sur une présentation rapide d'un ensemble d'études linguistiques envisagées sur ce corpus

    D’abord, ensuite, enfin et 0, De plus: Organisation textuelle par des séries linéaires dans les articles de recherche

    Get PDF
    The study examines the signalling of text organisation in research articles (RA) in French. The work concentrates on a particular type of organisation provided by text sequences, i.e. structures organising text to items of which at least some are signalled by markers of addition or order: First… 0… The third point… In addition… / Premièrement… 0… Le troisième point… De plus… By indicating the way the text is organised, these structures guide the reader in the reading process so that he doesn’t need to interpret the text structure himself. The aim of the work is to study factors affecting the marking of text sequences. Why is their structure sometimes signalled explicitly by markers such as secondly, whereas in other places such markers are not used? The corpus is manually XML-annotated and consists of 90 RAs (~800 000 words) in French from the fields of linguistics, education and history. The analysis highlights several factors affecting the marking of text sequences. First, exact markers (such as fist ) seem to be more frequent in sequences where all the items are explicitly signalled by a marker, whereas additive markers (such as moreover) are used in sequences with both explicitly signalled and unmarked items. The marking of explicitly signalled sequences seems thus to be precise and even repetitive, whereas the signalling of sequences with unmarked items is altogether more vague. Second, the marking of text sequences seems to depend on the length of the text. The longer the text segment, the more vague the marking. Additive markers and unmarked items are more frequent in longer sequences possibly covering several pages, whereas shorter sequences are often signalled explicitly by exact markers. Also the marker types vary according to the sequence length. Anaphoric expressions, such as first, are fairly close to their referents and are used in short sequences, connectors, such as secondly, are frequently used in sequences of intermediate length, whereas the longest sequences are often signalled by constructions composed of an ordinal and a noun acting as a subject of the sentence: The first item is… Finally, the marking of text organisation depends also on the discipline the RA belongs to. In linguistics, the marking is fairly frequent and precise; exact markers such as second are the most used, and structures with unmarked items are less common. Similarly, the marking is fairly frequent in education. In this field, however, it is also less precise than in linguistics, with frequent unmarked items and additive markers. History, on the other hand, is characterised by less frequent marking. In addition, when used, the marking in this field is also less precise and less explicit.Siirretty Doriast

    Korpusaineistot

    Get PDF
    Peer reviewe

    Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers

    Get PDF
    Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 65-72. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206

    Affectivity in the #jesuisCharlie Twitter discussion

    Get PDF

    Predictive keywords: Using machine learning to explain document characteristics

    Get PDF
    When exploring the characteristics of a discourse domain associated with texts, keyword analysis is widely used in corpus linguistics. However, one of the challenges facing this method is the evaluation of the quality of the keywords. Here, we propose casting keyword analysis as a prediction problem with the goal of discriminating the texts associated with the target corpus from the reference corpus. We demonstrate that, when using linear support vector machines, this approach can be used not only to quantify the discrimination between the two corpora, but also extract keywords. To evaluate the keywords, we develop a systematic and rigorous approach anchored to the concepts of usefulness and relevance used in machine learning. The extracted keywords are compared with the recently proposed text dispersion keyness measure. We demonstrate that that our approach extracts keywords that are highly useful and linguistically relevant, capturing the characteristics of their discourse domain

    French Wikipedia Talk Pages: Profiling and Conflict Detection

    Get PDF
    International audienceWikipedia is a popular and extremely useful resource for studies in both linguistics and natural language processing (Yano and Kang, 2008; Ferschke et al., 2013). This paper introduces a new language resource based on the French Wikipedia online discussion pages, the WikiTalk corpus. The publicly available corpus includes 160M words and 3M posts structured into 1M thematic sections and has been syntactically parsed with the Talismane toolkit (Urieli, 2013). In this paper, we present the first results of experiments aiming at classifying and profiling the talk pages and threads in order to determine criteria for selecting discussions with conflicts
    corecore